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Abstract 

Methods of efficiently generating and classifying 
samples with specified multivariate normal distributions are 
discussed. Conservative confidence tables for sample sizes 
are given for selective sampling. Simulation results are 
compared with classified training data. Techniques for com- 
paring error and separability measures for two normal pat- 
terns are investigated and used to display the relationship 
between error and the Chernoff bound. 


The work described in this paper was supported by National 
Aeronautics and Space Administration Grant No. NGL 15-005-112. 
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INTRODUCTION 


There has been a significant amount of effort devoted 
to the design and evaluation of functions which "measure" 
the relative effectiveness of statistical pattern recogni- 
tion schemes in classifying data. Two of the more notable 
ones are the Bhattacharyya distance [5] (a special case of 
distribution pairs of the subsequent Chernoff bound [6, 

16, pp. 116-126]), and the divergence [7,8]. The motivation 
for these "distance measures" is that in some cases, theo- 
retical recognition error cannot be obtained easily. In 
the case of the normal assumption, the error expression is 
generally difficult if not impossible to evaluate analytical- 
ly. A technique [9,10] has been developed for obtaining 
theoretical error in a two-class problem using a Bayes de- 
cision rule and gaussian assumption. But error in 
recognition problems with an arbitrary number of normal 
classes has not in general been expressed in a manner which 
can be analyzed easily. 

Because of this problem, "distance" measures and bounds 
have great appeal. In multiple-class problems, some sort 
of average of the distance between pairs of classes often 
is used as a performance measure of various classification 
schemes (such as selecting feature sets) . 
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The ability of a separability measure i.o predict per- 
formance in statistical pattern recognition ultimately de- 
pends on its relationship with theoretical error. Some re- 
lationships between error and the Bhattacharyya distance 
and divergence are known [11,12,13,14]. These relationships 
are in the form of bounds on error. For the two cited 
separability measures, the most important relationship is 
that two-class error is bounded by one-half of the Bhatta- 
charyya coefficient [12] , and accuracy (one minus error) for 
two normal classes appears to be bounded above and below by 
an empirical relationship with the divergence described in 
[15] . From this empirical relationship, it appears that 
probability of correct recognition is less than or equal to 
the value of the normal distribution function at one-half 
of the square root of the divergence. That is 


< erf*(/D?2) , 

c — 


( 1 ) 


although this has not been proven yet. 

It is interesting to note that in [15] , the paper to 
which much of the motivation for use of divergence has been 
attributed, part of the relationship between divergence and 
accuracy was obtained using a Monte-Carlo type of simula- 
tion. It seems apparent, in looking over some of the litera- 
ture dealing with these and other error bounds, that a 
simulation type of analysis would have something to offer in 
understanding the relationship between error and these bounds. 
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Many advances in the use of error bounds have improved (at 
a cost of high mathematical complexity in some cases) error 
prediction in very specific areas (for instance, the use 
of the Chernoff bound in information theory and likelihood 
decoding error analysis [17, pp. 131-135; 18, pp. 394-398; 
19]). In the case of two gaussian distributions, one of 
the tightest known bounds on error which can be easily 
evaluated, the Chernoff bound, is "close" in predicting 
error for only special cases (such as in [16, pp. 126-133]). 
For more general two-class problems such as the one used 
in [15], an example in [10, p. 73] shows a case where this 
bound does insignificantly better than the Bhattacharyya 
coefficient (tightest known bound for normal data which can 
be expressed explicitly), which, in this case, isn't very 
close to actual error. Experience has shown that this is 
often the case in data from natural patterns such as multi- 
spectral data [1] modeled by the normal distribution. 

In many problems, however, it is not so important that 
a distance measure bound error, as it is that it should tend 
to indicate which classification scheme is best (not 
necessarily the same thing) . This is especially important 
in the case of multiple-hypothesis pattern recognition, be- 
cause even the tightest bounds lose most of their "potency" 
when they are averaged over all pairs of classes [20]. Also, 
measures which aren't averages over class pairs have yet to 
yield any analytic simplicity [2] . However, if one 
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separability measure has a weaker relationship with theo- 
retical error than another, it must be considered as a less 
reliable source of separability information. 

Simulation can provide a useful relationship between 
specific classification problems and the numbers produced 
by separability measures. For instance, the average di- 
vergence might be used to narrow a large number of feature 
sets down to several which have the highest value. Then, 
rather than classify the training samples using these fea- 
ture sets and compare (especially if this is physically 
cumbersome) , one might generate and classify samples with 
the same distribution as the training classes. Or, it may be 
the case that a researcher requires easy access to a large 
number of samples with a specific distribution in order to 
make a carefully controlled comparison of classification 
error and separability measures. 

The major disadvantage, when compared to most separa- 
bility measures, is the amount of machine time used to 
classify the samples. Also, the method is Monte-Carlo and 
not exact. Hence the degree of confidence varies with the 
number of samples used. These two drawbacks will be examined 
in this note. Also, certain properties of pairs of normal 


* 

In a forthcoming paper a new statistic for error will be 
introduced for cases where distributions are specified. 


Kl’iTlW*' - 

ORIGIN^ 


PAGEIN 


patterns are used to reduce the size of the sample space of 
mean vectors in case the relationship between recognition 
rate and other pair-wise separability measures is to be 
studied. Examples of all of the techniques are presented. 
Much of the material is tutorial in nature, but provides a 
necessary background for the methods described. 

A THEORETICAL BASIS FOR SIMULATION AND 
CONFIDENCE BOUNDS FOR THE RESULTS 

If one has available samples from the mixture density, a 
method of estimating error in using a decision rule which 
partitions the sample space is well known [4]. This method, 
random sampling or error counting, does not give estimates 
of conditional class error. However, precise confidence 
tables are available [22,4,10, p. 147] for computing sample 
size. Another method known as selective [4] or stratified 
[3, p. 255] sampling does yield these estimates and has an 
estimate for error with smaller variance than random 
sampling. Some conservative confidence tables are now de- 
veloped for selective sampling. No assumption of class 
distributions is made. 

Suppose that one has N^ sampJes from class i, and that 
the classification scheme under consideration classifies L^ 
of these samples correctly. Let P ci be the conditional 


probability of correct classification for class i using this 
scheme. Since L. is binomial with parameters N. and P . , it 
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is well known that the maximum likelihood estimate P . for 

ci 




( 2 ) 


unbiased. Further, suppose that there are M classes in this 

particular example, and that = P^N, where P^ is the a 

priori probability of class i, so that a total of N samples 

are used. Then the maximum likelihood estimate (see [10, 

pp. 145-148] and [27, pp. 47-48]) for overall theoretical 

error P = ZP.P . is 
c 1 ci 


/\ 



M 

l 

i=l 


P.P . 
1 ci 


M 

- y 

N 

i=l 


L i 


(3) 


unbiased. The absolute error in P is |P - P I, and its 

c 1 c c 1 

variance is EP^P ^(1 ” P c ^)/N ^®] . Using a basic inequal- 
ity of probability theory [21, p. 157], it can be shown that 
for any 6 > 0, 



< 


ZP i P ci (1 - P ci : 

~2 


= B. 


N 6 


(4) 


(all summations from 1 to M) . That is, the probability that 

A 

the estimated overall error P differs from the actual over- 

c 

all error P c by more than <5 is bounded by . But note that 

B. depends on the individual P . . If these were known, P 
1 ci c 

could be computed exactly. 
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A confidence bound with no dependence on the individual 
P c ^ may be easily obtained by noting that 


Hence 


max [IP .P . (1 - P . ) ] = T 
p 1 ci ci 4 

ci 


p { | p c - p c ii<> i »ii 


4N6 


7 " B 2 


(5) 


( 6 ) 


So we use N = 1/(4826 ) 

As an example, suppose that for a ten class problem, it 
is desired that the error in the estimate be greater than 
0.01 not more than 5% of the time. This corresponds to a 

A 

95% confidence that P is within 0.01 of P . For B_ = 0.05, 

c c 2 

M = 10, equal priors P^, ar.d 6 = 0.01, we find N = 50,000 
(5000 per class) samples required. For comparison, the 
assumption that P ^ = .8 and use of would result in the 
requirement that 32,000 samples be used. 

It might be noted that some similarity exists between 
this confidence expression and the classic confidence tables 
of [22] for random sampling. In the case of the latter, 
however, it is known that the distribution of the error in 
the estimate is binomial. This allows one to construct a 
much tighter confidence interval (or looking at it another 

A 

way, use fewer samples). |p - P | is in general binomial 

w V 

only for M = 2 in this paper. Further, the confidence of 

A 

1 - B 2 (which is >P{|P c “ P c l<6}) corresponds to the 


interval 


2 v'JJB ^ 


c 


2 By 


c 


c _ 


In the classic confidence tables, these intervals are not 
symmetric unless P c = 0.5. As an example, let P c = 0.5. 

For M = 2 and 95% confidence that the error does not exceed 

/\ 

.05 (P within .05 of P at least 95% of the time), we re- 
c c 

quire 2000 samples using (and B^) , while only about 400 
samples are required using the knowledge that is binom- 
ial . 

A graph of error 6 versus the total number of samples 
N is presented in Figure 1 for confidence levels of 75, 90, 

95, and 99%. A log-log scale is used in order to present 
a useful range of values. Because of the conservative na- 
ture of the bound, modest choices of 6 and confidence level may 
lead to large sample sizes. In fact the 95% confidence line 
for random sampling with P c = 0.5 would lie just above the 75% 
line in Figure 1, even though the variance of the selective 
sampling statistic is, in general, smaller. However, if 

A 

one needs the estimates P . , the latter statistic is more 

ci 

convenient (one may always use the tables of [22] to compute 
confidence in the individual P , and does not require 
randomization on the class numbers. 

When sample sizes are large, an approximation may be 
used. For fixed M and increasing N, the distribution func- 

A 

tion of P tends to become normal regardless of the values 
c 

of the P . [21,29, pp. 256-257]. The N+l discontinuities in 


the distribution become small "jumps, 
approximation using (5) becomes 


The confidence 
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PI |P C - P e l 1 -“1 “ 1 ' a = b 3 
c c 2/fr J 


where 100a is the percent confidence and is value at which 

the normal distribution function is 1 - a/2. Now we use 

2 2 , , 

N = Z a ^ 2 /(46 )• Figure 2 gives the resulting relationship 

for 75, 90, 95 and 99% confidence. In the example above 

for M = 10, we find that yields 9600 samples required 

(B^ 32,000; B 2 50,000). In the other example for I-i = 2, we 

get 385 (B^ , B 2 2000; binomial 400). The latter example 

points out the need for large sample sizes in using B^. If 

M is increased, even larger sizes are probably needed. 


EFFICIENT GENERATION Al.J CLASSIFICATION OF NORMAL SAMPLES 


Let us assume that a source of independent, normally 
distributed samples is available. Such a source can be 
approximated by using a power-residue technique to generate 
pseudo-random samples with approximately uniform distribution. 
Sets of these samples may then be normalized in accordance 
with the central-limit theorem to produce approximately 
normal samples. One of the most commonly used techniques 
employing this procedure is described in [23, pp. 94-96] 

(this reference describes the theoretical basis for the 
algorithm used on IBM/360 computers in the SSP subroutine 
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RANDU) . Samples generated using this method have little 
sample correlation [24] . Another well known method is the 
inverse method. It is faster than the above (using typical 
set sizes) , requires only one uniformly distributed sample, 
and, for all practical purposes, i s not truncated. Let the 
random variable X be uniformly distributed on the interval 
from zero to one. Let F(*) represent the desired distribu- 
tion with inverse F ^(*). Then Y = F '*'(X) has distribution 
F(*). For F normal, good approximations are available 
[26, pp. 191-192; see SSP subroutine NDTRI ] . This reference 
[26] is the reason the method for normal F is sometimes 
called Hastings method. Other fast procedures are given in 
[27, pp. 90-95] . 


Designate a normal density for class i with n by 1 mean 
vector NT and n by n covariance matrix as N(M^,.K i ). Let 
Q. be an orthogonal transformation which diagonalizes K. as 

K 1 


Q fc K.Q. 

w l 11 




(9) 


X 


n 


where A^ is the n by n diagonal matrix of eigenvalues X for 
(so that is a matrix with eigenvectors of for its 
columns [25, pp. 80-99]). Form n by 1 random vectors X with 
density N(0,I) by taking n normal samples with zero mean 
and unit variance and use them for the components of X. If 
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we define 




A 1 / 2 = 
1 




o 


/ir 


n 


( 10 ) 


then the random vector Y. = Q^A^ 2 X + is N(Ii^,K^). 

These steps may be graphically depicted as the illus- 
tration in Figure 3 shows. Cioss-sections (level surfaces 
or surfaces of constant probability) of densities for n=2 

are shown which are in this case, ellipses. Figure 3a 

1/2 

represents the desired distribution. A/ scales the samples 
tc obtain variances which correspond to those of the prin- 
cipal components '■'f the original covariance matrix. ro- 

tates the samples (or coordinate system) until the principal 
components of the density are parallel to those of the 
desired density (same correlation between features) . Adding 
NT locate*: the mean at the desired value. 

This method is quite straightforward, but very time 

consuming. Each vector X must be multiplied by the matrix 

1/2 2 2 
Q 1 ^ i for a total of n multiplications and n additions for 

each X . Adding NT to each X could be eliminated by shifting 

all of the distributions by that amount. This also raises 

the possibility of classifying the X samples directly in a 
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transformed feature space rather than transforming them fi^st 
and classifying in the original space. Since classification 
error is invariant under linear transformations and shifts, 

A A 

and P P are the only desired results, we note that 
ci c 

X = A~ 1/2 Q^ (Y^- M i ) (11) 

is N (0,1) and transform all of the other class parameters 
as (see Appendix A) 


* 

K . 
3 


AT 1 / 2 Q t k.q.a7 1/2 

1 l ] 1 l 


( 12 ) 


* 



A7 1/2 Q t (M. 
1 x j 


V 


(13) 


Thus we can use N(0,I) samples directly to represent 

A 

class i and obtain P ^ by classifying these samples using 
the above expressions for the other covariance matrices and 
mean vectors. This process can be characterized as a trans- 
formation of the feature space to fit the samples, rather 
than a transformation of the samples to fit the feature 
(although the two are equivalent). In other words, the fea- 
ture space is transformed in a manner analogous to going 
backw rds in Figure 3 from 3e to 3b. 

It might also be noted that the normalizing process 
used to obtain N(0,I) samples from uniformly distributed 
samples could be incorporated into this procedure to elimin- 
ate more unnecessary computations (e.g., don't normalize). 


ggSSXZ 
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Also, logjKjj (which is used in the decision rule) is just 
log|Kj| - log|K^|, so that these need be computed only once 
for the entire simulation. 


Applications for Normal Data 
The normal assumption appears to v“>rk reasonably well 
in classified designs when applied to agricultural categor- 
ies of multispectral data [1). Recently, a powerful test 
of normality was developed and used on this data, the results 
of which lead one to believe that in some cases, the assump- 
tion is not unreasonable [28] . Using the same data with 
classes defined in [1] , an experiment was conducted to com- 
pare the results of estimating P c by simulation with the 
value obtained by classifying training samples, using sta- 
tistic# obtained from those samples. Eight classes (corn, 
soybeans, wheat, alfalfa, bare soil, oats, clover, rye) were 
used with 12 features (wavelength bands) . One thousand 
samples per class were generated by the methods described 
above for each of the feature sets (1), { 1 , 2 } , . . . {1 , 2 , . . . , 12 } . 

The results for estimating overall error, P - 1 - P , and 

e c 

conditional error, P . = 1 - P . for the class wheat are 

ei ci ' 

given in Figure 4, a and b. Agreement seems to be fairly 
good, with simulation results appearing more optimistic in 
terms of accuracy, as might be expected (the generated data 
should fit the normal assumption better) . 


‘naxinu* likelihood for Man and covariance with biaa correction 
applied to the latter. 
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TWO CLASS PROBLEMS 

Simulation studies of two class separability measures 
for certain types of distributions may yield useful infor- 
mation for classifier design. An example is given in [15] , 
where recognition rate is compared to divergence values for 
two normal patterns. Knowledge of the behavior of such 
measures may allow the researcher to define new measures 

for M class problems which improve performance in feature 
* 

selection. 

For normal patterns, it is well known that both covari- 
ance matrices may be simultaneously diagonalized, one into 
the identity matrix. Then the transformed means of these 
classes may be shifted so that the class with identity co- 
variance has its mean at the origin. Thus, all cases of 
pairs of normal patterns may be simulated by considering 
only classes with diagonal covariance matrices, one equal 
to the identity with zero mean vector. In 
this case computation of separability measures such as the 

Chernoff and Bhattacharyya bound, divergence, and even true 

* * 

error are relatively straightforward ([10, pp. 72, 284,62 - 
64 ’respectively) . One need generate values for the parameters 

it 

A forthcoming paper will explore this topic. 

** 

Changing the sign in Equation 3-51 and 3-52 from + to 
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of the class with arbitary mean vector only. 

The major problem is the amount of samples from the 
parameter space (of mean and variance components) needed to 
obtain representative results. Obvious symmetry (in the 
sense of error) allows the use of only non-negative mean 
components. Yet another type of symmetry exists. We see 
from Figure 5a that there is reflective symmetry abouc l 1 ' 'e-- 
of equal mean components. Here a two-feature example is 
sketched to show that for every set of mean and variance 
components chosen in the subset of non-negative mean compon- 
ents, a simple permutation of these component values yields 
a different distribution with the same error, still contained 
in this subset. Proceeding to the general case of n features, 
it is apparent that this property yields the requirement 
that only mean vectors with monotone components are required. 

Since there are 2 n combinations of signs for the components 
of an arbitrarily chosen mean vector, and because the re- 
striction to positive signs leaves n! choices of inequalities 
between components (fix m^ on the real line, leaving two 
places for m 2 , three for etc.), the restriction of, say, 
m. > m_ > ... > m reduces the size of the set of possible 
mean vectors with components restricted in magnitude by a 
factor of l/(n!2 n ). Figure 5b depicts this process for n=3 
and a m^ _> m 2 m^ 0. 

The method is readily applied to experiments where an 
attempt to pre-determine covariance and mean values is de- 
sired. These values may be incremented by a fixed amount 

• i 
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over a range of numbers, so as to insure that repre- 
sentative combinations are covered (one objection to a 
random approach) . Generating random components is a bit more 
difficult if one desires a uniform distribution on the set of 
possible mean components. This would involve more compli- 
cated software to compute the assignment of probability mass 
for successive mean components conditioned on the value of 
a previous one. Experience has shown feat order statistics 
or random walks (m^ uniform on to a) give satisfactory 

results. 

As an example, 40,000 sets of parameters were generated, 
1,000 each for sets of 2, 3, and 4 components, and 37,000 
for one component (due to time considerations in computing 
error for n>l) , and both P c and the Chernoff bound were 
computed. The result is given in Figure 6. Order sta- 
tistics for uniformly distributed random numbers on the 
interval from 0.0 to 6.0 were used to obtain mean components. 
Variance values were obtained from numbers uniform on .01 
to 25.0. P c was computed using the method of [10]. 

One interesting possibility raised by the above example 

is that a relationship between the Chernoff distance C (minus 

the loq of the coefficient) and P , similar to that of the 

c 

divergence, may exist. For equal covariance matrices, 

P = erf (✓3C) (14) 

c * 

Plotting the right hand side of (14) with P c yields Figure 7, 
suggesting that 
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P c < erf ^ (/Tc) , (15) 

but which has not been proven. A check of the numbers 
generated has thus far established empirical agreement 
with (15).. 

Summary and Conclusion 

Motivation for the use of Monte-Carlo type simulation 
in the study of classifier design includes avoiding the dif- 
ficulty in obtaining error exactly, and the desire to obtain 
relationships between error and separability measures for 
various classes of density functions. Selective sampling 
was reviewed and conservative confidence bounds for sample 
sizes developed. The confidence relationships are weaker 
than those for random sampling. However, random sampling 
does not provide controlled size estimates of conditional 
class errors. Methods of generating and classifying normal 
data were discussed and an example representing classifica- 
tion of multispectral agricultural data was given. For 
studies of pair-wise separability measures involving normal 
patterns, methods of selecting statistical parameters ef- 
ficiently were given. An example depicting the relationship 
between the Chernoff bound and correct recognition was pre- 
sented. The results suggest the possibility of the existence 
of a tight lower bound on error in terms of the Chernoff 
distance for normal patterns. 
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APPENDIX A 


Let Z be N(M.,K.). Then X = A~ 1//2 Q^(Z - M.) has mean 
j j 1 1 l 

vector 

M* = E[A" i/2 Q^(Z - M i )] = A” 1/2 Q^(EZ - M i ) 

= AT 1/2 Q^(M. - M i ) (Al) 

and covariance matrix 


K* = E[A _1/2 Q t (Z - M.) - A _1/2 Q t (M. - M.JHsame]* 1 
3 l l l l x 3 l 


= A- 1 / 2 Qj[E(Z - Mj ) ( Z - Mj ) fc ] QjA T 1/2 


= AT 1/2 Q t K. Q.AT 1/2 
i 1311 


(A2) 


Thus classifying Z 'v N(Mj,Kj) is equivalent to classifying 

X ^ N[AT 1/2 Q t (M. - M.), A7 1/2 Q t K. Q.A7 1/2 ] = N (M* , K*) 

1 13 11 1311 3 3 


which for class i is N(0,I). In fact if we define the dis- 
criminant for class j at X as 

g ( X) = Cj + log | K* | + (X - M*) t K*" 1 (X - M*) (A3) 

where is the cost and a priori probability constant, we 
find that substitution of (Al) and (A2) yields 
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g (x) 


Thus 


= C. + log | K j | - log | K i | + (Z - Mj) - M..) 

= g (Z) - log | K i | 


the discriminant values differ by a constant. 


(A4) 


1 



TOTAL NUMBER SAMPLES N 


Figure 1: Conservative Confidence Values for Selec- 
tive Sampling. 
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Figure 2: Confidence Values for Selective Sampling 
Using the Normal Assumption. 
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4: Simulation Results for Normal Data. 
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Figure 6. Probability of Correct Classification 
versus Chernoff Bound. 



